We start by loading the dataset and generating a summary to better understand the data we will be working with.
concrete <- as.data.frame(read_excel("Concrete_Data.xls"))
DescVars <- names(concrete)
names(concrete) <- c("Cement","Slag","FlyAsh","Water","Superplast","CoarseAggr","FineAggr","Age","Strength")
summary(concrete)
## Cement Slag FlyAsh Water
## Min. :102.0 Min. : 0.0 Min. : 0.00 Min. :121.8
## 1st Qu.:192.4 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:164.9
## Median :272.9 Median : 22.0 Median : 0.00 Median :185.0
## Mean :281.2 Mean : 73.9 Mean : 54.19 Mean :181.6
## 3rd Qu.:350.0 3rd Qu.:142.9 3rd Qu.:118.27 3rd Qu.:192.0
## Max. :540.0 Max. :359.4 Max. :200.10 Max. :247.0
## Superplast CoarseAggr FineAggr Age
## Min. : 0.000 Min. : 801.0 Min. :594.0 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.: 932.0 1st Qu.:731.0 1st Qu.: 7.00
## Median : 6.350 Median : 968.0 Median :779.5 Median : 28.00
## Mean : 6.203 Mean : 972.9 Mean :773.6 Mean : 45.66
## 3rd Qu.:10.160 3rd Qu.:1029.4 3rd Qu.:824.0 3rd Qu.: 56.00
## Max. :32.200 Max. :1145.0 Max. :992.6 Max. :365.00
## Strength
## Min. : 2.332
## 1st Qu.:23.707
## Median :34.443
## Mean :35.818
## 3rd Qu.:46.136
## Max. :82.599
Next, we split the dataset into 700 samples for training, with the remaining samples allocated for testing.
set.seed(123)
train_samples<- sample(nrow(concrete), 700)
train <- concrete[train_samples, ]
test <- concrete[-train_samples, ]
We will fit a Random Forest model to determine the importance of each variable and its contribution to the response variable.
A split occurs at a decision node of a tree and is chosen to maximize some impurity reduction metric. The splitting criteria is:
\[C(T) - C(T') = N_rQ_r - (N_{r'}Q_{r'} + N_{r''}Q_{r''})\]
Where \(Q_r\) represents the impurity measure (in this case, the variance of the response variable) at node \(r\). This implies that if \(C(T) - C(T')\) is greater than zero, the split is considered justified, as it indicates a reduction in impurity (variance) after the split.
The importance of a variable in a Random Forest is derived from how much it contributes to reducing impurity across all trees in the forest.
The resulting scores represent the variable importance and indicate how much each feature contributed to the modelās performance.
rf_imp <- ranger(formula = Strength ~ ., data = train, importance='impurity')
This approach evaluates the predictive power of each feature by comparing model performance before and after permuting the values of that feature.
Features that significantly degrade performance when permuted are more important because the model relies heavily on them for predictions.
rf_perm <- ranger(formula = Strength ~ ., data = train, importance='permutation')
rf_imp_vip <- vip(rf_imp, num_features = 8)
rf_perm_vip <- vip(rf_perm, num_features = 8)
grid.arrange(rf_imp_vip, rf_perm_vip, ncol=2, top="Left: Reduction in impurity at splits. Right: Out-of-bag permutations")
Results look very similar for both models. The order of the variables importance is almost the same, the only variation is in the order between the variables Slag and FineAggr, and FlyAsh and CoarseAggr, that are reversed between both models.
Now we will compute the variable importance using Shapley Values.
rf_imp_shapley <- vip(rf_imp, method="shap",
pred_wrapper=yhat, num_features = 8,
newdata=test[,-9], train=train)
rf_perm_shapley <- vip(rf_perm, method="shap",
pred_wrapper=yhat, num_features = 8,
newdata=test[,-9], train=train)
grid.arrange(rf_imp_vip, rf_perm_vip, rf_imp_shapley, rf_perm_shapley,
ncol=2, nrow=2, top="Top left: Impurity. Top right: OOB permutations. \n Bottom left: Shapley values for impurity. Bottom right: Shapley values for OOB permutations")
Looking at the Shapley Values, the comparison is almost the same, but the order is reversed between FlyAsh and CoarseAgg as in the previous step. Although, in this case, the variables Slag and FineAggr follow the same order.
We will follow the same approach by fitting a Linear Model and a Generalized Additive Model to analyze the contributions of the variables.
lm_model <- lm(Strength ~ ., data = train)
summary(lm_model)
##
## Call:
## lm(formula = Strength ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.934 -5.998 0.602 6.881 34.880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.163227 30.941577 -0.716 0.474
## Cement 0.113761 0.010020 11.353 < 2e-16 ***
## Slag 0.097243 0.011870 8.193 1.24e-15 ***
## FlyAsh 0.073444 0.014957 4.911 1.13e-06 ***
## Water -0.120030 0.046478 -2.583 0.010 *
## Superplast 0.517791 0.116083 4.461 9.54e-06 ***
## CoarseAggr 0.018545 0.010992 1.687 0.092 .
## FineAggr 0.013261 0.012532 1.058 0.290
## Age 0.121099 0.006842 17.700 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.13 on 691 degrees of freedom
## Multiple R-squared: 0.6252, Adjusted R-squared: 0.6209
## F-statistic: 144.1 on 8 and 691 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_model)
The residuals seem normally distributed, but not homoscedastic. This indicates that applying the logarithm to Strength might improve R-sq.
lm_log <- lm(log(Strength) ~ ., data = train)
summary(lm_log)
##
## Call:
## lm(formula = log(Strength) ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.51322 -0.22062 0.09024 0.26409 0.74334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7297821 1.1215704 1.542 0.123462
## Cement 0.0036110 0.0003632 9.942 < 2e-16 ***
## Slag 0.0028969 0.0004303 6.733 3.50e-11 ***
## FlyAsh 0.0027693 0.0005421 5.108 4.21e-07 ***
## Water -0.0030306 0.0016847 -1.799 0.072477 .
## Superplast 0.0163411 0.0042078 3.884 0.000113 ***
## CoarseAggr 0.0004923 0.0003984 1.236 0.217029
## FineAggr 0.0001757 0.0004543 0.387 0.699012
## Age 0.0041021 0.0002480 16.540 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3672 on 691 degrees of freedom
## Multiple R-squared: 0.5628, Adjusted R-squared: 0.5578
## F-statistic: 111.2 on 8 and 691 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_log)
Applying the logarithm to the response variable reduces the variance explained by the model, so we do undo the transformation.
gam_model <- gam(Strength ~ s(Age) + s(Cement) + s(Slag) + s(FlyAsh)
+ s(Superplast) + s(CoarseAggr) + s(FineAggr) + s(Water)
, data = train)
summary(gam_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Strength ~ s(Age) + s(Cement) + s(Slag) + s(FlyAsh) + s(Superplast) +
## s(CoarseAggr) + s(FineAggr) + s(Water)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.5892 0.2062 172.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 8.400 8.810 250.570 < 2e-16 ***
## s(Cement) 7.934 8.692 38.512 < 2e-16 ***
## s(Slag) 5.958 7.063 23.305 < 2e-16 ***
## s(FlyAsh) 7.503 8.368 6.926 < 2e-16 ***
## s(Superplast) 6.869 7.889 4.166 7.63e-05 ***
## s(CoarseAggr) 1.000 1.000 0.458 0.499
## s(FineAggr) 8.572 8.935 9.130 < 2e-16 ***
## s(Water) 8.512 8.916 17.078 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.89 Deviance explained = 89.9%
## GCV = 32.35 Scale est. = 29.774 n = 700
par(mfrow=c(2,2))
plot(gam_model)
gam.check(gam_model)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 24 iterations.
## The RMS GCV score gradient at convergence was 5.116767e-06 .
## The Hessian was positive definite.
## Model rank = 73 / 73
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Age) 9.00 8.40 0.97 0.190
## s(Cement) 9.00 7.93 0.89 <2e-16 ***
## s(Slag) 9.00 5.96 0.93 0.030 *
## s(FlyAsh) 9.00 7.50 0.91 0.005 **
## s(Superplast) 9.00 6.87 0.92 0.015 *
## s(CoarseAggr) 9.00 1.00 0.84 <2e-16 ***
## s(FineAggr) 9.00 8.57 0.82 <2e-16 ***
## s(Water) 9.00 8.51 0.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residuals are normally distributed and homoscedastic. However, hypothesis test shows that the number of knots to estimate s(Age) might be too low, so we increase it.
gam_model2 <- gam(Strength ~ s(Age, k=14) + s(Cement) + s(Slag) + s(FlyAsh)
+ s(Superplast) + s(CoarseAggr) + s(FineAggr) + s(Water)
, data = train)
gam.check(gam_model2)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 23 iterations.
## The RMS GCV score gradient at convergence was 1.404314e-05 .
## The Hessian was positive definite.
## Model rank = 75 / 77
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Age) 13.00 9.77 0.98 0.245
## s(Cement) 9.00 7.96 0.89 <2e-16 ***
## s(Slag) 9.00 6.26 0.93 0.040 *
## s(FlyAsh) 9.00 7.47 0.90 <2e-16 ***
## s(Superplast) 9.00 6.86 0.92 0.015 *
## s(CoarseAggr) 9.00 1.00 0.84 <2e-16 ***
## s(FineAggr) 9.00 8.57 0.82 <2e-16 ***
## s(Water) 9.00 8.60 0.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(gam_model2)
The model has not improved and we cannot increase k any further, since Age only takes 14 distinct values in the train dataset.
Overall, we keep the original full LM and GAM, since they will allow us to compare variable importance across them and the fitted random forest.
We will summarize, again numerically and graphically, the fitted models.
summary(lm_model)
##
## Call:
## lm(formula = Strength ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.934 -5.998 0.602 6.881 34.880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.163227 30.941577 -0.716 0.474
## Cement 0.113761 0.010020 11.353 < 2e-16 ***
## Slag 0.097243 0.011870 8.193 1.24e-15 ***
## FlyAsh 0.073444 0.014957 4.911 1.13e-06 ***
## Water -0.120030 0.046478 -2.583 0.010 *
## Superplast 0.517791 0.116083 4.461 9.54e-06 ***
## CoarseAggr 0.018545 0.010992 1.687 0.092 .
## FineAggr 0.013261 0.012532 1.058 0.290
## Age 0.121099 0.006842 17.700 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.13 on 691 degrees of freedom
## Multiple R-squared: 0.6252, Adjusted R-squared: 0.6209
## F-statistic: 144.1 on 8 and 691 DF, p-value: < 2.2e-16
summary(gam_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Strength ~ s(Age) + s(Cement) + s(Slag) + s(FlyAsh) + s(Superplast) +
## s(CoarseAggr) + s(FineAggr) + s(Water)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.5892 0.2062 172.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 8.400 8.810 250.570 < 2e-16 ***
## s(Cement) 7.934 8.692 38.512 < 2e-16 ***
## s(Slag) 5.958 7.063 23.305 < 2e-16 ***
## s(FlyAsh) 7.503 8.368 6.926 < 2e-16 ***
## s(Superplast) 6.869 7.889 4.166 7.63e-05 ***
## s(CoarseAggr) 1.000 1.000 0.458 0.499
## s(FineAggr) 8.572 8.935 9.130 < 2e-16 ***
## s(Water) 8.512 8.916 17.078 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.89 Deviance explained = 89.9%
## GCV = 32.35 Scale est. = 29.774 n = 700
par(mfrow=c(2,2))
plot(lm_model)
plot(gam_model)
par(mfrow=c(1,1))
lm_shapley <- vip(lm_model,
method="shap",
pred_wrapper=predict.lm,
num_features = 8,
newdata=test[,-9],
exact=TRUE, train=train)
gam_shapley <- vip(gam_model,
method="shap",
pred_wrapper=predict.gam,
num_features = 8,
newdata=test[,-9],
exact=TRUE, train=train)
grid.arrange(lm_shapley, gam_shapley, ncol=2, top="Left: Shapley values of linear model. \n Right: Shapley values of Generalized additive model")
The Shapley values for the Linear Model (LM) and the Generalized Additive Model (GAM) yield different insights compared to the Random Forests. In the LM and GAM, the Cement variable stands out as the most significant predictor, whereas in the Random Forests, Age plays a more prominent role. However, across all models, Cement is highlighted as the first or second most important variable, whereas the importance of the other variables can vary a little depending on the model.
source("relev.ghost.var.R")
Rel_Gh_Var_lm <- relev.ghost.var(model= lm_model,
newdata = test[,-9],
y.ts = test$Strength,
func.model.ghost.var = lm
)
par(cex.main = 1.7) # Increase eivenvalues plot's title size
plot.relev.ghost.var(Rel_Gh_Var_lm,n1=700,ncols.plot = 3)
aux <- cbind(Rel_Gh_Var_lm$relev.ghost,lm_shapley$data$Importance)
plot(aux[,1],aux[,2],col=0,xlab="Relev. by Ghost Variables",ylab="Shapley Var. Imp.")
text(aux[,1],aux[,2],row.names(aux))
From the plot, we can observe that, using ghost variable importance, age is the most relevant explanatory variable, followed by cement and slag. These first two components account for a significant portion of the variability in the data, with the first and second eigenvalues explaining 64% and 30% of the variability, respectively. In the first component, most of the relevance is attributed to age, while in the second component, cement and slag dominate.
When comparing these findings with the Shapley values, we notice a shift in relevance: age, which was previously the least relevant variable, is now the most important. Cement and slag remain relevant but with slightly reduced importance compared to the Shapley values. The relevance of all other variables appears to remain relatively consistent between the two models.
Rel_Gh_Var_gam <- relev.ghost.var(model= gam_model,
newdata = test[-9],
y.ts = test$Strength,
func.model.ghost.var = lm
)
par(cex.main = 1.7) # Increase eivenvalues plot's title size
plot.relev.ghost.var(Rel_Gh_Var_gam,n1=700,ncols.plot = 3)
aux <- cbind(Rel_Gh_Var_gam$relev.ghost,gam_shapley$data$Importance)
plot(aux[,1],aux[,2],col=0,xlab="Relev. by Ghost Variables",ylab="Shapley Var. Imp.")
text(aux[,1],aux[,2],row.names(aux))
From the plot, we can see that, using ghost variable importance, age remains the most relevant explanatory variable. The first component explains 65% of the variability, with the majority of its relevance attributed to age. The second component accounts for 20% of the variability, with relevance primarily attributed to Cement and FlyAsh, and slightly less to Slag.
When comparing this with the Shapley values, age remains the most significant variable. However, FlyAsh appears to have more importance in the ghost variable analysis than Cement and Slag, which, in contrast, are attributed more importance by the Shapley values.
rf_concrete = randomForest(Strength ~ ., data=train)
Rel_Gh_Var_rf <- relev.ghost.var(model=rf_concrete,
newdata = test[,-9],
y.ts = test$Strength,
func.model.ghost.var = lm
)
par(cex.main = 1.7) # Increase eivenvalues plot's title size
plot.relev.ghost.var(Rel_Gh_Var_rf,n1=700,ncols.plot = 3)
aux <- cbind(Rel_Gh_Var_rf$relev.ghost,rf_imp_shapley$data$Importance)
plot(aux[,1],aux[,2],col=0,xlab="Relev. by Ghost Variables",ylab="Shapley Var. Imp.")
text(aux[,1],aux[,2],row.names(aux))
Age seems to be the most important variable, the first component explain 88% of the variability and nearly all the singnificant is brought by age. Second component explain nearly 7% of the variability and main importance are attributed to Cement and Slag Compared to Shapley values, age is the most relevant, whereas in the latter is the at the bottom of the importance ranking, Cement and Slag seems to be important in both methods
Create random forest explainer.
explainer_rf <- explain.default(model = rf_imp,
data = test[,-9],
y = test$Strength,
label = "Random Forest")
## Preparation of a new explainer is initiated
## -> model label : Random Forest
## -> data : 330 rows 8 cols
## -> target variable : 330 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.17.0 , task regression ( default )
## -> predicted values : numerical, min = 8.363111 , mean = 36.30909 , max = 75.63867
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -20.31393 , mean = -0.006253068 , max = 25.08166
## A new explainer has been created!
Rnd_Perm <- model_parts(
explainer_rf,
N = NULL, # All available data are used
B = 10 # number of permutations to be used, with B = 10 used by default
)
Rnd_Perm
## variable mean_dropout_loss label
## 1 _full_model_ 5.934092 Random Forest
## 2 FlyAsh 6.659445 Random Forest
## 3 CoarseAggr 6.667983 Random Forest
## 4 FineAggr 7.148863 Random Forest
## 5 Slag 7.276549 Random Forest
## 6 Superplast 7.997304 Random Forest
## 7 Water 9.310147 Random Forest
## 8 Cement 11.550053 Random Forest
## 9 Age 14.198205 Random Forest
## 10 _baseline_ 22.010103 Random Forest
plot(Rnd_Perm)
aux.plot <- plot(Rnd_Perm)
dropout_loss.y <- Rnd_Perm$dropout_loss[1]
aux.I <- order(-aux.plot$data$dropout_loss.x)
rf_perm_DALEX_as_vi <- tibble::tibble(aux.plot$data[aux.I,c(2,4)])
class(rf_perm_DALEX_as_vi) <- c("vi", class(rf_perm_DALEX_as_vi))
names(rf_perm_DALEX_as_vi) <- c("Variable", "Importance")
rf_perm_DALEX_as_vi$Importance <-
rf_perm_DALEX_as_vi$Importance - dropout_loss.y
# Creating the ggplot:
rf_perm_DALEX_vip <- vip(rf_perm_DALEX_as_vi)
grid.arrange(rf_imp_vip, rf_perm_vip,
rf_perm_DALEX_vip, ncol=2, nrow=2,
top="Top left: Impurity. Top right: oob permutations. Bottom left: test sample permutations"
)
All three feature plots produce results that are consistent with one another, with the test sample permutation showing a closer resemblance to the OOB permutation.
PDP_rf <- model_profile(
explainer=explainer_rf,
variables = NULL, # All variables are used
N = NULL, # All available data are used
groups = NULL,
k = NULL,
center = TRUE,
type = "partial" # partial, conditional or accumulated
)
plot(PDP_rf, facet_ncol=2)
CDP_rf <- model_profile(
explainer=explainer_rf,
variables = NULL, # All variables are used
N = NULL, # All available data are used
groups = NULL,
k = NULL,
center = TRUE,
type = "conditional" # partial, conditional or accumulated
)
plot(CDP_rf, facet_ncol=2)
Conditional dependence is similar to partial dependence but reveals some nuanced differences. For instance, Superplast exhibits a sharper increasing dependency on the target variable, indicating its significant contribution at higher values. Similarly, CoarseAggr demonstrates a more pronounced decreasing dependency, implying a stronger negative impact as its value increases. Lastly, the relationship between Cement and the target variable appears to be more linear, suggesting a consistent influence across its range.
Now, we are interested in explaining the weakest and strongest concretes in the test dataset.
local_low <- test[which.min(test$Strength), ]
local_high <- test[which.max(test$Strength), ]
shap_low <- predict_parts(explainer=explainer_rf
, new_observation=local_low
, type="shap")
shap_high <- predict_parts(explainer=explainer_rf
, new_observation=local_high
, type="shap")
shap_low
## min q1 median
## Random Forest: Age = 3 -13.3202610 -12.4934376 -11.9042256
## Random Forest: Cement = 108.3 -9.9072352 -8.4276250 -8.0929361
## Random Forest: CoarseAggr = 938.2 -0.7558079 -0.6006756 -0.0982932
## Random Forest: FineAggr = 849 -3.1922013 -2.3517140 -2.1107560
## Random Forest: FlyAsh = 0 -0.7713527 -0.2302253 0.1858270
## Random Forest: Slag = 162.4 -1.2426721 0.5253203 0.8709904
## Random Forest: Superplast = 0 -5.0589186 -3.2178575 -2.6077533
## Random Forest: Water = 203.5 -5.3702437 -4.8750094 -4.1984289
## mean q3 max
## Random Forest: Age = 3 -11.8290053 -11.4294416 -9.6588984
## Random Forest: Cement = 108.3 -7.9283065 -7.4237452 -6.0221260
## Random Forest: CoarseAggr = 938.2 -0.1580071 0.1208928 0.4116251
## Random Forest: FineAggr = 849 -2.1606687 -1.9265369 -1.6104835
## Random Forest: FlyAsh = 0 0.2219229 0.9014332 1.2210993
## Random Forest: Slag = 162.4 0.9301561 1.8981939 2.2844444
## Random Forest: Superplast = 0 -2.8316599 -2.0377832 -1.3882451
## Random Forest: Water = 203.5 -4.1904128 -3.7691978 -2.2508298
plot(shap_low)
shap_high
## min q1 median mean
## Random Forest: Age = 91 7.2574873 8.7627443 9.310507 9.201861
## Random Forest: Cement = 389.9 4.9079839 6.3844236 7.192446 7.401717
## Random Forest: CoarseAggr = 944.7 0.4968657 0.7495974 1.271960 1.560704
## Random Forest: FineAggr = 755.8 1.0275997 1.4705977 1.760566 1.870738
## Random Forest: FlyAsh = 0 0.9630022 1.3074026 1.735142 1.875402
## Random Forest: Slag = 189 2.7208130 2.9332063 3.513683 4.103456
## Random Forest: Superplast = 22 3.2328063 3.9284667 4.676292 4.805212
## Random Forest: Water = 145.9 5.8082881 6.4046618 6.639292 6.618314
## q3 max
## Random Forest: Age = 91 9.889640 10.597548
## Random Forest: Cement = 389.9 8.945368 9.582976
## Random Forest: CoarseAggr = 944.7 2.154907 3.466131
## Random Forest: FineAggr = 755.8 2.308660 2.881110
## Random Forest: FlyAsh = 0 2.273406 2.956602
## Random Forest: Slag = 189 5.118516 6.583485
## Random Forest: Superplast = 22 5.507521 6.869975
## Random Forest: Water = 145.9 6.856756 7.346905
plot(shap_high)
According to SHAP, the weakest concrete in the test dataset is so because of the following reasons, in decreasing relevance: it is young, has: little cement, a lot of water, little superplast and a lot of FineAggr.
On the other hand, all concrete features strengthen the strongest concrete in the dataset. The most important ones are: Age (old), Cement (large), Water (little), Superplast (large), Slag (large). However, the importance of some of these features has a big standard error, namely Cement, Superplast and Slag, so their relevance order might actually be lower or larger.
bd_low <- predict_parts(explainer_rf
, new_observation=local_low
, type="break_down")
bd_high <- predict_parts(explainer_rf
, new_observation=local_high
, type="break_down")
bd_low
## contribution
## Random Forest: intercept 36.309
## Random Forest: Age = 3 -12.408
## Random Forest: Cement = 108.3 -5.679
## Random Forest: Water = 203.5 -2.401
## Random Forest: Slag = 162.4 -0.040
## Random Forest: Superplast = 0 -3.960
## Random Forest: FineAggr = 849 -2.077
## Random Forest: FlyAsh = 0 -0.719
## Random Forest: CoarseAggr = 938.2 -0.663
## Random Forest: prediction 8.363
plot(bd_low)
bd_high
## contribution
## Random Forest: intercept 36.309
## Random Forest: Age = 91 9.373
## Random Forest: Water = 145.9 6.118
## Random Forest: Cement = 389.9 4.908
## Random Forest: Superplast = 22 3.641
## Random Forest: Slag = 189 4.995
## Random Forest: FineAggr = 755.8 2.192
## Random Forest: FlyAsh = 0 2.746
## Random Forest: CoarseAggr = 944.7 3.466
## Random Forest: prediction 73.746
plot(bd_high)
The break-down plot for the weakest concrete disagrees with the SHAP estimators with the importance of Slag and FlyAsh, since this plot shows that they make this concrete weaker, even if little (-0.129 MPa and -0.706 MPa, respectively). We can see how the total contribution of all predictive variables to the response reduce the observationās Strength from the 35.548 intercept to the 8.363 prediction.
As for the strongest concrete, the break-down plotās interpretation is also similar to SHAPās, but some features importance change. For example, now Water is more important than Cement, Superplast is less relevant than Slag, CoarseAggr contributes more than FineAggrā¦
Overall, we should be more confident on SHAPās results because break-downās depend on the order of the explanatory variables, which is a clear downside of this Local IML method.
lime_low <- predict_surrogate(explainer_rf
, new_observation=local_low
, type="localModel")
lime_high <- predict_surrogate(explainer_rf
, new_observation=local_high
, type="localModel")
lime_low
## estimated variable original_variable dev_ratio response
## 1 36.3090920 (Model mean) 0.485556
## 2 45.6673191 (Intercept) 0.485556
## 3 -11.4028739 Cement <= 238.27 Cement 0.485556
## 4 0.0502877 FlyAsh <= 94.11 FlyAsh 0.485556
## 5 -5.6135248 Water > 181.88 Water 0.485556
## 6 -1.5576940 Superplast <= 6 Superplast 0.485556
## predicted_value model
## 1 8.363111 Random Forest
## 2 8.363111 Random Forest
## 3 8.363111 Random Forest
## 4 8.363111 Random Forest
## 5 8.363111 Random Forest
## 6 8.363111 Random Forest
plot(lime_low)
lime_high
## estimated variable original_variable dev_ratio response
## 1 36.3090920 (Model mean) 0.3957745
## 2 33.4424892 (Intercept) 0.3957745
## 3 9.1400776 Cement > 350 Cement 0.3957745
## 4 1.8299698 Slag > 161.49 Slag 0.3957745
## 5 0.9045521 FlyAsh <= 94 FlyAsh 0.3957745
## 6 5.9824659 Water <= 166.03 Water 0.3957745
## predicted_value model
## 1 73.7465 Random Forest
## 2 73.7465 Random Forest
## 3 73.7465 Random Forest
## 4 73.7465 Random Forest
## 5 73.7465 Random Forest
## 6 73.7465 Random Forest
plot(lime_high)
LIMEās explanation of the weakest concrete coincides with SHAPās except for Age. Specifically, the method has identified four properties of this observation that make it the weakest one: Cement <= 238.14, Water > 181.1, FineAggr > 755.8, Superplast <= 5.74 (in decreasing relevance order).
In the strongest concrete, LIMEās results do not mention Ageās paper neither. Furthermore, CoarseAggr, even if not very relevant in SHAPās nor Break-downās results, does not appear in LIMEās. Again, it shows that this concreteās strength is due to having: a lot of cement, superplast and slag, litte water and a lot, but not too much, FineAggr.
cp_low <- predict_profile(explainer_rf, new_observation=local_low)
cp_high <- predict_profile(explainer_rf, new_observation=local_high)
plot(cp_low, facet_ncol=2)
plot(cp_high, facet_ncol=2)
With ICE, we can understand better why the weakest concrete is so. We can see that, ceteris paribus, strong concretes are at least 100 days old and that, before then, strength is acquired very quickly. Hence, ceteris paribus, a concrete 3 days old is very weak. The amount of cement also seems to grow linearly (from now on, all statements regarding the effect of predictor variables to the response are assumed ceteris paribus), allowing the concrete to gain a lot of strength with large quantities of it, but this is not the case of the weakest concrete. Other relevant features are FineAggr (which almost do not strengthen the concrete when FineAggr > 760 \(km/m^3\)), Superplast (which its addition strengthens the concrete until there is about 12 kg in a \(m^3\) mixture) and Water, which shows a local minimum around 200 \(kg/m^3\), the weakest concreteās concentration.
Strongest concreteās ICE show similar, but displaced, plots. Moreover, this observationās predictive variables are found in neighborhoods of local maxima, hence maximizing the strengthening factor of every component in the concrete mixture. Regarding the constant displacement with respect to the weakest observation, it is more significant in features with flat slopes, such as CoarseAggr, FlyAsh or Slag, which increase from around 10 to about 70. This shows the importance of assuming ceteris paribus, since the ICE plot of a predictive variable changes with respect to the observationās values in the rest of variables.
age_rf <- model_profile(explainer_rf, variables="Age", N=100, type="partial")
plot(age_rf, geom = "profiles") +
ggtitle("Ceteris-paribus and partial-dependence profiles for Age")
We can see how the previously described ICE plot of Age ceteris paribus is similar for all observations in the test dataset and, therefore, consistent with the PDP. That is, there is a steep increase from Age=0 to Age=80 and then, it remains almost constant. Between different observations, however, we notice a constant displacement. Then, the lowest observation is about 25 MPa weaker than the average (DPD) and the highest is about 20 MPa stronger.